What are determining factors of GPA?

Introduction:

Students all over the world are measured by one value when applying for jobs to begin their post academic career- Grade Point Average or GPA. This value is important because it not only is thought to provide insight into students’ aptitude, but also their work ethic and ability to get tasks/assignments competed. With GPA being so valuable, it is important to analyze different factors that may affect a students’ GPA. The provided dataset, food, gives insight of 125 different students and specific details about their lives, including GPA, income, gender, what they eat in a day, parental education details, as well as whether or not the student indulges in daily coffee. The full dataset contains 125 rows and 61 columns of information that can be used to assess the factors of a students GPA.

Within this research paper evidence will be provided showing the relationship between GPA, student income, the father’s educational level, and/or the student’s perception of what an ideal diet is, and which one of these factors mostly affects GPA. To answer the research question: Is GPA related to student income, the father’s educational level, or the student’s perception of what an ideal diet is? the following columns are analyzed: GPA < chr >, income < dbl >, father_education < dbl >, and ideal_diet_coded < dbl >.

Approach:

The food dataset was initially uncleaned and needed to be assessed and engineered in order to complete the required analysis. Upon preliminary review of the data, it was evident that not all of the columns were accurately classified. This required manipulation and a refined version of the given dataframe. In order to meticulously analyze the data of the students GPA, geom_boxplot()’s will be generated to determine the relationships between the following variables: income, father_education, and ideal_diet_coded. Boxplots provide a visual summary of the data enabling identification of mean values, dispersion of the data set, and signs of skewness.

The below table shows the uncleaned version of the first five rows of the refined food dataframe:

food_refined <- select(food, GPA, income, father_education, ideal_diet_coded)
head(food_refined)

## # A tibble: 6 × 4
##   GPA   income father_education ideal_diet_coded
##   <chr>  <dbl>            <dbl>            <dbl>
## 1 2.4        5                5                8
## 2 3.654      4                2                3
## 3 3.3        6                2                6
## 4 3.2        6                2                2
## 5 3.5        6                4                2
## 6 2.25       1                1                2

Since not all of the cells in the GPA column were classified as numeric, the column was split into two columns separated by a space character using the separate() function. Then the cells that were still classified as non-numeric were converted to NAN and numeric values using the replace_with_na_at() and the as.numeric() functions. All of the categorical variables that were in use in this smaller dataframe (i.e. income, father_education, and ideal_diet_coded) were encoded using numerical values. Because of this, it was necessary to use the function case_where() in order to accurately descibe what each numerical value was associated with. These textual associations were found using the detailed data dictionary for this dataset, available here.

Analysis:

The below chunck of code shows the data wrangling portion and cleaning/manipulation that was required to complete the analysis. The first five rows of the cleaned dataframe is shown.

food_refined <- select(food, GPA, income, father_education, ideal_diet_coded) %>% #refining dataset
  separate(GPA,into = c("GPA"),sep = "(?=[a-z +]+)(?<=[0-9])") %>% #separating the GPA column to remove excess characters
  replace_with_na_at("GPA", ~.x == "(?=[a-z +]+)") %>% #converting the GPA values that contain only text to NaN value
  mutate(
    ideal_diet_coded = case_when( #categorizing the ideal diet variable
      ideal_diet_coded == 1 ~ "1. portion control",
      ideal_diet_coded == 2 ~ "2. adding fruits/veggies",
      ideal_diet_coded == 3 ~ "3. balance",
      ideal_diet_coded == 4 ~ "4. less sugar",
      ideal_diet_coded == 5 ~ "5. home cooked/organic",
      ideal_diet_coded == 6 ~ "6. current diet",
      ideal_diet_coded == 7 ~ "7. more protein",
      ideal_diet_coded == 8 ~ "8. unclear",
      TRUE ~ NA_character_ # should never reach
    )
  ) %>%
  mutate(
    income = case_when( #categorizing the income variable
      income == 1 ~ "1. < $15,000",
      income == 2 ~ "2. $15,001 to $30,000",
      income == 3 ~ "3. $30,001 to $50,000",
      income == 4 ~ "4. $50,001 to $70,000",
      income == 5 ~ "5. $70,001 to $100,000",
      income == 6 ~ "6. > $100,000",
      TRUE ~ NA_character_ # should never reach
    )
  ) %>%
  mutate(
    father_education = case_when( #categorizing the fathers education variable
      father_education == 1 ~ "1. < high school",
      father_education == 2 ~ "2. high school degree",
      father_education == 3 ~ "3. some college",
      father_education == 4 ~ "4. college degree",
      father_education == 5 ~ "5. graduate degree",
      TRUE ~ NA_character_ # should never reach
    )
  ) %>%
  filter(!is.na(GPA)) #removing rows with no valid GPA
food_refined$GPA <- as.numeric(food_refined$GPA) #converting the GPA column to a numeric column
head(food_refined)

## # A tibble: 6 × 4
##     GPA income                 father_education      ideal_diet_coded        
##   <dbl> <chr>                  <chr>                 <chr>                   
## 1  2.4  5. $70,001 to $100,000 5. graduate degree    8. unclear              
## 2  3.65 4. $50,001 to $70,000  2. high school degree 3. balance              
## 3  3.3  6. > $100,000          2. high school degree 6. current diet         
## 4  3.2  6. > $100,000          2. high school degree 2. adding fruits/veggies
## 5  3.5  6. > $100,000          4. college degree     2. adding fruits/veggies
## 6  2.25 1. < $15,000           1. < high school      2. adding fruits/veggies

The summary and tables are shown below:

Summary of the GPA data:

summary(food_refined$GPA)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.200   3.200   3.500   3.419   3.700   4.000       4

Table of the students income:

table(fct_relevel(food_refined$income))

## 
##           1. < $15,000  2. $15,001 to $30,000  3. $30,001 to $50,000 
##                      6                      7                     17 
##  4. $50,001 to $70,000 5. $70,001 to $100,000          6. > $100,000 
##                     20                     33                     41

Table including details of the students father’s education level:

table(fct_relevel(food_refined$father_education))

## 
##      1. < high school 2. high school degree       3. some college 
##                     4                    34                    12 
##     4. college degree    5. graduate degree 
##                    46                    28

Table of an ideal diet:

table(fct_relevel(food_refined$ideal_diet_coded))

## 
##       1. portion control 2. adding fruits/veggies               3. balance 
##                       11                       44                       17 
##            4. less sugar   5. home cooked/organic          6. current diet 
##                        6                       15                       13 
##          7. more protein               8. unclear 
##                       16                        3

GPA Related to Student Income:

ggplot(
  na.omit(food_refined[c("income","GPA")]), 
  aes(fct_relevel(income), GPA, fill = factor(income))
  ) +
  geom_boxplot(
    alpha = 0.75
  ) + 
  labs( #adding labels
    title = "GPA by Student Income",
    x = "Student Income",
    y = "GPA"
    ) +
  scale_fill_brewer(
    name = "",
    palette = "Set3" #custom palette
    ) +
  theme_bw( #adding a theme for visualization 
  ) + 
  theme( #aesthetics
    legend.position = "top",
    axis.line = element_line(colour = "black"), 
    panel.border = element_blank(),
    panel.background = element_blank(),
    axis.text.x =  element_text(size = 0)
    )

GPA Related to Father’s Education Level:

ggplot(
  na.omit(food_refined[c("father_education","GPA")]), 
  aes(fct_relevel(father_education), GPA, fill = factor(father_education))
  ) +
  geom_boxplot(
    alpha = 0.75
  ) + 
  labs( #adding labels
    title = "GPA by Father's Education Level",
    x = "Father's Education Level",
    y = "GPA"
    ) +
  scale_fill_brewer(
    name = "",
    palette = "Set3" #custom palette
    ) +
  theme_bw( #adding a theme for visualization 
  ) + 
  theme( #aesthetics
    legend.position = "top",
    axis.line = element_line(colour = "black"), 
    panel.border = element_blank(),
    panel.background = element_blank(),
    axis.text.x =  element_text(size = 0)
    )

GPA related to students’ perception of an ideal diet:

ggplot(
  na.omit(food_refined[c("ideal_diet_coded","GPA")]), 
  aes(fct_relevel(ideal_diet_coded), GPA, fill = factor(ideal_diet_coded))
  ) +
  geom_boxplot(
    alpha = 0.75
  ) + 
  labs( #adding labels
    title = "GPA by Ideal Diet",
    x = "Ideal Diet",
    y = "GPA"
    ) +
  scale_fill_brewer(
    name = "",
    palette = "Set3" #custom palette
    ) +
  theme_bw( #adding a theme for visualization 
  ) + 
  theme( #aesthetics
    legend.position = "top",
    axis.line = element_line(colour = "black"), 
    panel.border = element_blank(),
    panel.background = element_blank(),
    axis.text.x =  element_text(size = 0)
    )

Discussion:

Based on the above boxplots, of the three columns, income, father_education, and ideal_diet_coded, GPA is affected by Ideal Diet and Father’s Education.

The students perception of an ideal diet showed some variation in GPA statistics. The statistics of ideal_diet_coded and GPA show that students who were unclear of what an ideal diet would be had a significantly lower GPA and higher dispersion than those who answered ‘portion control’ or ‘less sugar’. The student who answered ‘less sugar’ for the ideal diet had the highest GPA and the lowest dispersion. It is important to note that the number of observations of students who answered ‘less sugar’ and those who were ‘unclear’ were the lowest of the options in the ideal_diet_coded column, therefore, the results may be inaccurate.

There is generally more variation in the statistics of GPA when analyzing the variable of father_education. It makes sense that the mean for the students with a parent with an education level that is less than high school to have a lower GPA than students whose father completed a college degree. There is significance in this observation because there were a relatively high number of students who answered ‘graduate degree’.

The reason that the conclusion is not solely Father’s Education is because of the statistics for those whose fathers have a graduate degree. The statistics are counter intuitive, in that one would think that because of the access to certain resources that come with having a parent with such a high level of education, a student would be able to succeed in having a high GPA. There is no noticeable relationship between students income and GPA. The analysis performed in this research document show that it is important to analyze the different factors that may contribute to a students GPA rather than associating this value with a students aptitude, work ethic, and ability to get tasks/assignments competed.